Skip to content

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425

Merged
nicoloboschi merged 5 commits into
vectorize-io:mainfrom
qxxaa:fix(consolidation)-observation-search-vector
Jun 30, 2026
Merged

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425
nicoloboschi merged 5 commits into
vectorize-io:mainfrom
qxxaa:fix(consolidation)-observation-search-vector

Conversation

@qxxaa

@qxxaa qxxaa commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Problem

The consolidator creates and updates observations in memory_units without populating the search_vector column. Under the native text search extension backend, this makes all observations invisible to BM25 full-text retrieval.

This is a known gap documented in the source code itself:

# Native: the migration p4q5r6s7t8u9 dropped the GENERATED expression on
# search_vector to allow per-deployment language configuration; the
# batch insert path in ops_postgresql.insert_facts_batch now populates
# it via to_tsvector($lang, ...). This single-observation INSERT does
# not, so observations under the native backend currently land with
# NULL search_vector and are not BM25-searchable until reflected/
# re-ingested. Tracking a separate fix for that gap.

This PR fills that gap.

The raw fact batch insert path (ops_postgresql.insert_facts_batch) correctly populates search_vector with to_tsvector(). The consolidation pipeline - which produces the observation layer that recall surfaces - does not. The BM25 retrieval arm consistently returns 0 candidates for observations:

Parallel retrieval: semantic=12, bm25=0, graph=95, temporal=1

After the fix:

Parallel retrieval: semantic=12, bm25=3, graph=95, temporal=1

Root cause

Four code paths in consolidator.py write observation text without populating search_vector:

Function Operation Trigger
_dedup_reconcile_create UPDATE New observation merges into near-twin
_dedup_reconcile_update UPDATE Updated observation drifts into different twin
_execute_update_action UPDATE LLM rewrites existing observation
_create_observation_directly INSERT New observation created

Fix

Add conditional search_vector population to all four sites, gated on config.text_search_extension == "native". This matches the existing pattern in ops_postgresql.insert_facts_batch.

For UPDATE paths (Sites 1-3), a search_vector_clause string is conditionally appended to the SET clause:

search_vector_clause = (
    f",\n            search_vector = to_tsvector('{config.text_search_extension_native_language}'::regconfig, COALESCE($1, ''))"
    if config.text_search_extension == "native"
    else ""
)

For the INSERT path (Site 4), the existing else branch (which bundled native + pg_textsearch + pgroonga + pg_search) is split into elif "native" (with tsvector) and else (without), ensuring non-native backends continue to leave search_vector NULL as they index base text columns directly.

Backend gating rationale

The fix is restricted to text_search_extension == "native" because:

  • vchord: Has its own INSERT branch with tokenize() - already handles search_vector
  • native: Uses PostgreSQL's built-in tsvector/GIN for BM25 - needs explicit to_tsvector() population
  • pg_textsearch / pgroonga / pg_search: Index the raw text column directly via their own extension mechanisms - search_vector is a dummy column that should remain NULL

tsvector content: observation text only

The tsvector is built from COALESCE(text, '') - the observation's own text content only. Entity names, source fact text, and temporal metadata are intentionally excluded because:

  • Semantic retrieval already handles meaning-based discovery (embeddings)
  • Graph retrieval already handles entity-based discovery (entity links)
  • Temporal retrieval already handles time-based discovery (occurred_start/end, mentioned_at)

BM25's role is keyword matching on the observation's surface text - the one signal the other three arms don't cover. Enriching the tsvector with entity names or source context would create "contextual leakage" where non-central details pull general observations into narrow keyword matches.

Regression risk

None. The fix only fires when text_search_extension == "native" - all other backends are unaffected. The tsvector is built from the observation's own text content using the configured language dictionary (config.text_search_extension_native_language), matching the existing pattern in insert_facts_batch. No new parameters are added to any SQL query.

Testing

Added test_create_observation_populates_search_vector_native in tests/test_consolidation.py. Asserts that observations created via consolidation have search_vector IS NOT NULL under the native text search backend (Site 4 - the INSERT path, highest frequency).

The UPDATE path (Site 3) is verified to work via live deployment testing but not unit-tested - triggering an UPDATE deterministically requires a real LLM making consolidation decisions (the mock LLM always creates fresh observations). Sites 1-2 (dedup reconcile) are similarly impractical to trigger without controlling embedding similarity thresholds.

Additionally verified on a live deployment with ~139 observations:

  • Pre-patch: all observations have search_vector IS NULL, BM25 returns 0
  • Post-patch: new observations created by consolidation have populated tsvectors
  • BM25 arm returns relevant candidates (3+ observations matching keyword queries)
  • Semantic, graph, and temporal arms unaffected

qxxaa added 2 commits June 26, 2026 16:24
The consolidator creates and updates observations without populating the
search_vector tsvector column. Under the native text search extension,
this means observations are invisible to BM25 full-text retrieval - the
BM25 arm returns 0 candidates regardless of query content.

Four code paths write observation text to memory_units:
1. _dedup_reconcile_create (merge into existing twin)
2. _dedup_reconcile_update (drift-merge into different twin)
3. _execute_update_action (LLM rewrite of existing observation)
4. _create_observation_directly (new observation INSERT)

None populated search_vector. This patch adds conditional tsvector
generation gated on config.text_search_extension == 'native', matching
the existing pattern in ops_postgresql.insert_facts_batch. Non-native
backends (pg_textsearch, pgroonga, pg_search) continue to leave
search_vector NULL as they index base text columns directly.

The INSERT path (Site 4) splits the existing else branch into an
explicit elif/else to avoid applying native tsvector logic to backends
that don't use it.

Fixes: observations invisible to BM25 retrieval arm.
Add test for observation creation with native search vector
@qxxaa

qxxaa commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Deployment note: Existing observations created before this fix still have search_vector IS NULL. A one-time backfill is needed for full BM25 coverage:

UPDATE memory_units
SET search_vector = to_tsvector('english'::regconfig, COALESCE(text, ''))
WHERE fact_type = 'observation' AND search_vector IS NULL;

Suggest including this in the release notes. Note: the regconfig language should match the deployment's HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE setting. This only applies to deployments using text_search_extension = "native" and the query can be expensive on large tables, so operators should schedule it during low traffic.

For external PG instances this is a manual step; for the default baked-in pg0 deployment it could potentially be an Alembic migration, but that's a separate decision for maintainers.

qxxaa and others added 3 commits June 26, 2026 16:46
…ations

The writer fix only populates search_vector for observations created or
updated after deploy. Observations already written under the native
backend keep a NULL search_vector and stay invisible to BM25 until
re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the
native tsvector column type and scoped to fact_type='observation' with a
NULL search_vector (idempotent). Matches the writer's text-only tsvector
and the configured native language.
post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked
up from the contributor's local hookspath and committed by mistake. They
are unrelated to this change; the project's real .githooks/pre-commit is
left intact.
@nicoloboschi nicoloboschi merged commit 21176f8 into vectorize-io:main Jun 30, 2026
87 checks passed
nicoloboschi added a commit that referenced this pull request Jul 1, 2026
…nfig, trace recorder leak) (#2491)

* test(consolidation): fix dedup merge-path tests missing text-search config

The dedup merge/update path builds a search_vector UPDATE clause from
config.text_search_extension (+ _native_language) since #2425, but the
_dedup_reconcile_create / _dedup_reconcile_update test configs only set
consolidation_dedup_threshold, so the two merge-path tests raised
AttributeError: 'types.SimpleNamespace' object has no attribute
'text_search_extension' on main.

Add the two fields (production defaults native/english) to those configs.
The clause reuses $1, so the existing positional-arg assertions are unchanged.

* test(fact-extraction): pass Vertex AI settings when building LLMConfig

Regression: LLMConfig was refactored to use vertexai_project_id/region/
service_account_key as-passed (the caller resolves the global-config fallback),
but the llm_config fixture never forwarded them. So with the CI provider set to
vertexai, LLMConfig raised "HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID is required"
even though the env var was set — the test errored in the test-api job.

Forward the three Vertex settings from config (mirroring MemoryEngine's own
LLMConfig construction). Verified: LLMConfig(provider="vertexai", ...) now
constructs with project_id passed, and still raises when it is omitted.

* test(llm-provider): forward provider-specific settings in _make_llm

_make_llm() built an LLMProvider from the env-selected provider without
forwarding provider-specific settings, so the vertexai and litellmrouter
acceptance-matrix jobs failed at construction ("VERTEXAI_PROJECT_ID is required"
/ "litellmrouter requires a config object"). LLMProvider uses these as-passed
(it does not resolve them from global config), so forward
vertexai_project_id/region/service_account_key and litellmrouter_config.

* test(llm-trace): drop leaked span recorders after each test (#2229)

Root cause of the flaky test_llm_trace::test_disabled_writes_no_rows:
MemoryEngine.__init__ registers its LLM-trace recorder in a process-global
registry, and only close() removes it. Tests that construct an engine directly
(test_per_operation_llm_config, test_llm_reasoning_effort_env, etc.) never
close it, leaking an ENABLED recorder. Locally those recorders' writes fail
(uninitialized backend), but in CI a leaked recorder with a live backend
records a later test's LLM calls into the shared DB — so test_disabled_writes_
no_rows sees rows for its bank even though its own recorder is disabled
(assert N == 0). Reproduced: after test_per_operation_llm_config the registry
holds 8 enabled recorders.

Add an autouse fixture that snapshots the registry and removes anything a test
leaked. Verified: the registry drops from 8 leaked recorders back to 1.
_teardown_memory_engine already guards the fixtures; this guards direct
constructions. 53 trace + leak-risk tests pass together under -n2.
oliverv pushed a commit to oliverv/collabmind-hindsight that referenced this pull request Jul 1, 2026
…TE in consolidator (vectorize-io#2425)

* fix: populate search_vector on observation INSERT/UPDATE in consolidator

The consolidator creates and updates observations without populating the
search_vector tsvector column. Under the native text search extension,
this means observations are invisible to BM25 full-text retrieval - the
BM25 arm returns 0 candidates regardless of query content.

Four code paths write observation text to memory_units:
1. _dedup_reconcile_create (merge into existing twin)
2. _dedup_reconcile_update (drift-merge into different twin)
3. _execute_update_action (LLM rewrite of existing observation)
4. _create_observation_directly (new observation INSERT)

None populated search_vector. This patch adds conditional tsvector
generation gated on config.text_search_extension == 'native', matching
the existing pattern in ops_postgresql.insert_facts_batch. Non-native
backends (pg_textsearch, pgroonga, pg_search) continue to leave
search_vector NULL as they index base text columns directly.

The INSERT path (Site 4) splits the existing else branch into an
explicit elif/else to avoid applying native tsvector logic to backends
that don't use it.

Fixes: observations invisible to BM25 retrieval arm.

* Implement test for search vector population in observations

Add test for observation creation with native search vector

* style: run lint

* fix(consolidation): backfill search_vector for existing native observations

The writer fix only populates search_vector for observations created or
updated after deploy. Observations already written under the native
backend keep a NULL search_vector and stay invisible to BM25 until
re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the
native tsvector column type and scoped to fact_type='observation' with a
NULL search_vector (idempotent). Matches the writer's text-only tsvector
and the configured native language.

* chore: remove accidentally committed git-lfs hooks

post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked
up from the contributor's local hookspath and committed by mistake. They
are unrelated to this change; the project's real .githooks/pre-commit is
left intact.

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants